26 research outputs found
In the Maze of Data Languages
In data languages the positions of strings and trees carry a label from a
finite alphabet and a data value from an infinite alphabet. Extensions of
automata and logics over finite alphabets have been defined to recognize data
languages, both in the string and tree cases. In this paper we describe and
compare the complexity and expressiveness of such models to understand which
ones are better candidates as regular models
Automata Tutor v3
Computer science class enrollments have rapidly risen in the past decade.
With current class sizes, standard approaches to grading and providing
personalized feedback are no longer possible and new techniques become both
feasible and necessary. In this paper, we present the third version of Automata
Tutor, a tool for helping teachers and students in large courses on automata
and formal languages. The second version of Automata Tutor supported automatic
grading and feedback for finite-automata constructions and has already been
used by thousands of users in dozens of countries. This new version of Automata
Tutor supports automated grading and feedback generation for a greatly extended
variety of new problems, including problems that ask students to create regular
expressions, context-free grammars, pushdown automata and Turing machines
corresponding to a given description, and problems about converting between
equivalent models - e.g., from regular expressions to nondeterministic finite
automata. Moreover, for several problems, this new version also enables
teachers and students to automatically generate new problem instances. We also
present the results of a survey run on a class of 950 students, which shows
very positive results about the usability and usefulness of the tool
Synthesizing Specifications
Every program should always be accompanied by a specification that describes
important aspects of the code's behavior, but writing good specifications is
often harder that writing the code itself. This paper addresses the problem of
synthesizing specifications automatically. Our method takes as input (i) a set
of function definitions, and (ii) a domain-specific language L in which the
extracted properties are to be expressed. It outputs a set of
properties--expressed in L--that describe the behavior of functions. Each of
the produced property is a best L-property for signature: there is no other
L-property for signature that is strictly more precise. Furthermore, the set is
exhaustive: no more L-properties can be added to it to make the conjunction
more precise.
We implemented our method in a tool, spyro. When given the reference
implementation for a variety of SyGuS and Synquid synthesis benchmarks, spyro
often synthesized properties that that matched the original specification
provided in the synthesis benchmark
PECAN: A Deterministic Certified Defense Against Backdoor Attacks
Neural networks are vulnerable to backdoor poisoning attacks, where the
attackers maliciously poison the training set and insert triggers into the test
input to change the prediction of the victim model. Existing defenses for
backdoor attacks either provide no formal guarantees or come with
expensive-to-compute and ineffective probabilistic guarantees. We present
PECAN, an efficient and certified approach for defending against backdoor
attacks. The key insight powering PECAN is to apply off-the-shelf test-time
evasion certification techniques on a set of neural networks trained on
disjoint partitions of the data. We evaluate PECAN on image classification and
malware detection datasets. Our results demonstrate that PECAN can (1)
significantly outperform the state-of-the-art certified backdoor defense, both
in defense strength and efficiency, and (2) on real back-door attacks, PECAN
can reduce attack success rate by order of magnitude when compared to a range
of baselines from the literature
The Dataset Multiplicity Problem: How Unreliable Data Impacts Predictions
We introduce dataset multiplicity, a way to study how inaccuracies,
uncertainty, and social bias in training datasets impact test-time predictions.
The dataset multiplicity framework asks a counterfactual question of what the
set of resultant models (and associated test-time predictions) would be if we
could somehow access all hypothetical, unbiased versions of the dataset. We
discuss how to use this framework to encapsulate various sources of uncertainty
in datasets' factualness, including systemic social bias, data collection
practices, and noisy labels or features. We show how to exactly analyze the
impacts of dataset multiplicity for a specific model architecture and type of
uncertainty: linear models with label errors. Our empirical analysis shows that
real-world datasets, under reasonable assumptions, contain many test samples
whose predictions are affected by dataset multiplicity. Furthermore, the choice
of domain-specific dataset multiplicity definition determines what samples are
affected, and whether different demographic groups are disparately impacted.
Finally, we discuss implications of dataset multiplicity for machine learning
practice and research, including considerations for when model outcomes should
not be trusted.Comment: 25 pages, 8 figures. Accepted at FAccT '2